{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Domain Background"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lung cancer is the second most common cancer in both men and women that afflicts 225,500 people a year in the United States. About 1 out of 4 cancer deaths are from lung cancer, more than colon, breast, and prostate cancers combined [1]. Early detection of the cancer can allow for early treatment which significantly increases the chances of survival [2]. \n",
    "\n",
    "Lung cancer screening is performed with a CT scan that collects hundreds of images to build a full 3D image of the lung. Next, small growths called pulmonary nodules need to be detected. These nodules show up as small, circular structures on the CT scans. \n",
    "\n",
    "\n",
    "![cigs-on-screening-ct.jpg](cigs-on-screening-ct.jpg)\n",
    "\n",
    "\n",
    "\n",
    "In some cases, the nodules are not obvious and may take a trained eye and considerable amount of time to detect. ***Building a machine learning algorithm that can automatically detect the nodules can save considerable time and money.*** \n",
    "\n",
    "![weak_data2.png](weak_data2.png)\n",
    "\n",
    "Additionally, most pulmonary nodules are not cancerous as they can also be due to non-cancerous growths, scar tissue, or infections [1].  The task is then to determine the features of a nodule that are associated with malignancy. Current state of the art methods yields a 25% false positive rate in CT lung cancer screenings [4]. ***A convolutional neural network may be used to determine the features associated with cancerous or non-cancerous pulmonary nodules, and may reduce the false positive rate of CT lung cancer screenings.*** "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Problem Statement"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use a convolutional neural network to detect the probability lesions in the lungs are cancerous. The neural network should be trained to maximize that probability when it is and minimize it when it doesn't. This optimization function can be defined by the log loss.  \n",
    "\n",
    "- **Task:** Analyze CT scan images of lungs to find nodules which are defined as a discrete, well-marginated, rounded opacity less than or equal to 3cm in diameter that is completely surround by lung parenchyma, does not touch the hilum or mediastinum [3]. Then create a classifier to calculate the probability a nodule is cancerous.\n",
    "\n",
    "![lungmembranes1332515803416.png](lungmembranes1332515803416.png)\n",
    "\n",
    "- **Training set:** 3D CT scan slice images with a label for each patient as cancer or non-cancer. 1595 patient samples.\n",
    "- **Performance:** Maximize the probability that the cancer is present when it is and minimize the probability when it isn't present\n",
    "- **Target function:** Log Loss:\n",
    "\n",
    "\n",
    "$$LogLoss=-\\frac{1}{n}\\sum_{i=1}^{n}[y_{i}\\log{(\\hat{y_{i}})}+(1−y_{i})\\log{(1−\\hat{y_{i}})}],$$\n",
    "\n",
    "where\n",
    "\n",
    "• $n$ is the number of patients in the test set\n",
    "\n",
    "• $\\hat{y_{i}}$ is the predicted probability of the image belonging to a patient with cancer\n",
    "\n",
    "• $y_{i}$ is 1 if the diagnosis is cancer, 0 otherwise\n",
    "\n",
    "• $log()$ is the natural (base e) logarithm\n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Datasets and Inputs\n",
    "\n",
    "The dataset includes CT scan images of 1,595 patients collected from the Kaggle Data Science Bowl 2017 [5]. For each patient, there are 100-150 CT scan slices, forming a 3D composite of the lung. The samples are labeled as either cancerous or non-cancerous.\n",
    "\n",
    "CT scan works by using x-rays to image structures deep within the tissue with high spatial resolution. The x-rays can also detect the type of substance it transmits and is quantitatively expressed as a unit of radiodensity, called the Hounsfield Unit (HU). For example, air has very high transmittance and has a radiodensity of -1000HU. Soft tissue has a radio density of 100-300HU. Bone has a radiodensity between 700-3000HU. Most of the space within the lung consists of air. Nodules consist of spherical tissues within the lung with a radiodensity between 100-300HU. \n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Solution Statement\n",
    "A deep-learning algorithm with a convolutional neural network can first detect structures that resemble nodules. This can be done by using pre-defined convolutional filters that approximate the shape of nodules. These filters can be created by averaging the shape of X amount of known nodules. These convolutional filters can be then scanned throughout the entire lung to determine the positions likely for the shape to be present. \n",
    "\n",
    "Next, the nodules need to be determined if they are cancerous or not. A convolutional neural network may be used to determine the shape features associated with cancerous or non-cancerous by training the convolutional network.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Benchmark Model\n",
    "Currently, the best known method to determine if a nodule is cancerous or non-cancerous is with the Solitary Pulmonary Nodule Malignant Risk Calculator. The calculator is based off a large scale data study that calculated the statistical risk of malignancy of a nodule based on a variety of measured attributes. \n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Evaluation Metrics\n",
    "By using this calculator, the probability of a nodule being cancerous can be determined, and this can be entered into the log loss function for the dataset. The log loss score with this calculator sets the benchmark to outperform with the proposed methods. \n",
    "\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Project Design\n",
    "\n",
    "\n",
    "**Process data**\n",
    "\n",
    "The dataset consists of 66GB of Dicom files, the standard for medical images. These need to be processed to its ndarrays to be inputted into tensorflow. The following tasks for processing need to be performed:\n",
    "\n",
    "\t1. Convert dicom to ndarray\n",
    "\t2. Resample image so a pixel represents a fixed unit length of the lung, eg: 1px=1mm. This is important since each image is inconsistent and this can pose issues with the convolutional model.\n",
    "\t3. Mask the lung to minimize the search space of the convolutional filters. This can greatly speed up the training since only a small proportion of the image consists of the lung.\n",
    "\t4. Regularize the data by centering it to zero and dividing by the mean. \n",
    "\n",
    "**Defining nodule filter:**\n",
    "\t1. Take 10 samples labeled as cancerous and find the nodule by manually extracting it by eye. Overlap the nodules and normalize their radii. Average them. Then create N number of convolutional filters with varying sizes.\n",
    "\t\n",
    "**Extracting the nodules from samples:**\n",
    "\t1. Input processed data into a 1-layer convolutional network and run through all the pre-defined nodule filters. Use either a softmax or a ReLU function as the activation function. For positions that cross X threshold of the activation function, capture the input at that position and output it to an isolated ndarray.\n",
    "\t\n",
    "**Classifying nodules as cancerous vs non-cancerous**\n",
    "\t1.   Input nodules into a convolutional neural network with two to three layers and an output layer with two neurons, cancerous and non-cancerous. The weights of the network can be trained with gradient descent or adam optimizer. Ideally, the neural network will determine the features in the shapes that are associated with cancerous and non-cancerous.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# References\n",
    "1. https://www.cancer.org/cancer/lung-cancer/prevention-and-early-detection/exams-and-tests.html\n",
    "2. http://www.mayoclinic.org/diseases-conditions/lung-cancer/basics/tests-diagnosis/con-20025531\n",
    "3. http://www.radiologyassistant.nl/en/p460f9fcd50637/solitary-pulmonary-nodule-benign-versus-malignant.html\n",
    "4. https://biometry.nci.nih.gov/cdas/approved-projects/531/\n",
    "5. https://www.kaggle.com/c/data-science-bowl-2017 "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
